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We construct a semiparametric estimator in case-control studies where the gene and the envi- 
ronment are assumed to be independent. A discrete or continuous parametric distribution of the 
genes is assumed in the model. A discrete distribution of the genes can be used to model the mu- 
tation or presence of certain group of genes. A continuous distribution allows the distribution of 
the gene effects to be in a finite-dimensional parametric family and can hence be used to model 
the gene expression levels. We leave the distribution of the environment totally unspecified. The 
estimator is derived through calculating the efficiency score function in a hypothetical setting 
where a close approximation to the samples is random. The resulting estimator is proved to be 
efficient in the hypothetical situation. The efficiency of the estimator is further demonstrated to 
hold in the case-control setting as well. 

Keywords: case-control study; gene-environment interaction; logistic regression; semiparametric 
efficiency 

1. Introduction 

Case-control designs are frequently implemented in clinical studies where, instead of tak- 
ing a random sample of a mixed population of both cases and non-cases, a fixed number 
of cases and a fixed number of controls are randomly sampled from the respective pop- 
ulations of cases and non-cases. Because the resulting samples are no longer random or 
independently and identically distributed (i.i.d.), the classical large-sample asymptotic 
theories could fail to apply. In the literature, two main approaches are taken in order 
to adapt the large-sample theory to the case-control setting. The first approach is high- 
lighted in Brcslow et al. (2000), where a modified design of the usual case-control study 
is proposed. The resulting random sample is then linked to the true case-control sample 
through using results from McNeney (1998), where the similarity between random and 
non-random sample asymptotic properties is developed by almost establishing the whole 
asymptotic theory under non-i.i.d. samples. The second approach is somewhat more di- 
rect and is implicitly used by Rabinowitz (2000). Instead of treating the indicator (£)) 
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of case/control as a random variable, D is assumed to be known and all the calculations 
are performed conditionally on D. Although it docs result in the conditional random- 
ness of the case-control samples, the resulting data is not really identically distributed. 
Specifically, two different distributions are involved and the large-sample theory is still 
not available. Strictly speaking, the asymptotic theory for non-i.i.d. data rederived in 
McNeney (1998) also needs to be applied in order to treat such a combination of two 
sample cases. 

In addition to the complexity arising from a case-control design, the problem consid- 
ered in this article is also a semiparametric model problem, whose efficient estimator has 
not yet been explored even in the i.i.d. data situation. Specifically, the problem is as 
follows. Suppose that in the general population, the occurrence of a disease (D — l) fol- 
lows a logistic model logit{Pr(_D = 1)} = m(G, E), where G represents a person's genetic 
character and E represents the environmental elements. Further, suppose that G and E 
are independent of each other and that we are interested in the effect of gene, environ- 
ment and their interaction on the disease status. Thus, m(g, e) = C + Pig + P%e + Page- 
The parametric form of the distribution of gene g is assumed to be known as q(g,Pi), 
where pi is an unknown finite-dimensional parameter. The distribution of the environ- 
ment, 77(e), is unspecified. A special version of this problem is considered in Chatterjee 
and Carroll (2005), where q{g,Pi) is assumed to be a discrete distribution. There, the 
authors derived a profile maximum likelihood estimator for p = (p c , Pi, P2, P3, P4) T and 
showed that it is root-iV consistent, where N is the size of the combined samples. The 
estimator is later extended to a more general framework in Spinka et al. (2005). How- 
ever, it is not investigated whether the estimator achieves the optimal semiparametric 
efficiency. 

In this paper, we first establish in Section 2 that the classical semiparametric theory 
of Bickcl et al. (1993) is applicable in general case-control studies, without having to 
rederive the theory in parallel or having to resort to the results from McNeney (1998). 
Such first order asymptotic equivalence between case-control sampling and random sam- 
pling is a new result. We then proceed to compute the semiparametric efficient score and 
construct a semiparametric estimator for P in Section 3. The computation is carried out 
in a hypothetical population described in Section 2. This differs from the real population 
from which the cases and controls are drawn. Hence, the derivation has its own interest 
and novelty. In this section, we also prove that although the estimation of the nuisance 
parameter rj is bypassed in our estimator, the resulting semiparametric estimator still 
achieves the optimal efficiency. The proof and treatment is rather non-standard. Numer- 
ical examples are included in Section 4 to demonstrate the performance of the proposed 
estimator. The performance of the method in the discrete gene model is close to that of 
the method in Chatterjee and Carroll (2005) and we pointed out the possible equivalence 
between the two methods in Section 5. Some analytical derivations and technical details 
are included in the Appendix. 
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2. Case-control data versus i.i.d. data 

The samples from a case control study are not random because the disease status is not 
random. In general, the design randomly samples N\ individuals from the case population 
and No from the non-case population. However, let us consider a hypothetical population 
of interest with infinite population size, in which the disease to non-disease ratio is fixed 
at 7r = Ni/Nq. Here, the reason for introducing the notion of hypothetical population is 
to be able to use the classical semiparametric theory for i.i.d. data, developed in Bickcl 
et al. (1993). If the sample of size N = No + N i from a case-control study happens to 
be a random sample from the hypothetical population of interest, then we have a size-iV 
i.i.d. random sample and the usual semiparametric analysis will apply. The asymptotic 
results hold when N —> oo and it stays fixed. 

Of course, the problem is that a random sample of size N from the hypothetical pop- 
ulation of interest does not have to have exactly No controls and N± cases, hence we 
cannot immediately equate a case-control sample and a random sample from the hypo- 
thetical population. In general, the number of controls/cases of a random sample from 
the hypothetical population will have a binomial distribution Nj ~ Binomial(iV, Nj/N), 
d = 0, 1, which is very close to a normal distribution when N is large, that is, 
(N£ — Nj) / y/ Ntt(1 — 7r) — s- Normal(0, 1) in distribution when N — > oo. Here, the super- 
script r stands for 'random.' Furthermore, the probability of having |JV? — Nd\ > TV 2 / 3 
goes to zero when N — > oo. Thus, we could think of the case-control sample as obtained 
by randomly picking a size-iV sample from the hypothetical population of interest, then 
deleting a random o p (N 2 / 3 ) cases (controls) and adding a random o p (N 2 / 3 ) controls 
(cases). Or, alternatively, we can think of the case-control sample as a random sam- 
ple of size N, but with a randomly chosen o p (7V 2 / 3 ) data contaminated in a particular 
way. This "particular" contamination implies the following three properties: (i) the con- 
tamination happens only to o p (N) of the observations (in the case-control samples, the 
contamination in fact only happens to o p (iV 2 / 3 ) observations, but, in general, o p (N) is al- 
ready sufficient for our further analysis); (ii) the contaminated data is still of order 0(1), 
that is, \Xf — Xi\ is bounded in probability for i = 1, . . . , N; (iii) the zero expectation 
holds for the contaminated observations, that is, if an estimating equation for f3 of the 



form J2lLt f{Xi\ /?) = satisfies E{f(X i; fj )} = 0, then E{f{Xf; fj )} = as well. Here, 



Xi,i = l,...,N, are i.i.d. random samples, the superscript c stands for 'contaminated' 
and the subscript o represents the true parameter value. 

When the case-control sample is viewed as a contaminated random sample from the 
hypothetical population of interest, the first two "particular" properties certainly hold. 
For the estimator we will construct, we shall demonstrate that the third property also 
holds. Thus, if we can show that the same first order asymptotics apply to both the 
i.i.d. sample of size N and its contaminated version as long as the three properties hold, 
then we can treat the case-control sample as an i.i.d. sample. 

The argument is as follows. Assume that we mistakenly treated the contaminated data 
as i.i.d. and obtained an efficient estimator: 



N 




(2.1) 
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Here, S c g is the efficient score function and its derivation is mo del- dependent. One obvi- 
ous aspect of S e fi worth emphasizing is that the construction of S e s does not depend on 
the observations. Regardless of the method of derivation, the efficient score function S e g 
has the property E{S c f[ (X,-; Po)} = 0- If we had the uncontaminatcd data, our subsequent 
estimator for /3 would have been X)i=i S e s(Xi; /3) = 0. Working with the contaminated 
data, (2.1) is the estimating equation we really have. Suppose that (3 solves (2.1). We 
then have 

N N N rK (X c -R*\ 

= £ S eS (Xf; $) = ]T S*(X? ; M + Lt CP - Po), 

i=l i=l i=l " 

therefore, 

-N-4t dSeSi *i ,n ) CN(P - Po) = N-^ J2 Po), (2-2) 

L i=l J i=l 

where f3* lies on the line connecting /3q and (3. Note that in our "particular" contamination 
requirement, only o p (N) terms yield a different Xi from Xf (requirement (i)) and, for 
each Xf ^ Xi, the difference is O p (l) (requirement (ii)), so we have 

f IV on ,w m^l ( N 



(2.3) 



= E |a^«| +0p(1) . 

From the third "particular" property, we have E{S e s (Xf; fio)} = (we will prove that 
this property holds for the case-control data in Section 3). In conjunction with the fact 
that only o p (N) of the terms S e g(Xf; f3o) — S o g(Xi;f3 ) are non-zero, we can further 
obtain 

N N 

TV" 1 / 2 S cS (Xf; Po) = N- 1 / 2 J2 S cS (X i; f3 ) + Op(l). (2.4) 

i=i i=i 

The detailed argument of (2.4) is the following. Suppose for the first I = o p (N) observa- 
tions, Xf ^ Xi. Then we have 

N 



N I 

= N- 1 ' 2 S ctf (X l ; A)) + N- 1 ' 2 J2{SMXf; Po) ~ S ctf (X l ; (3 )} 

i=l i=l 
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N- 1 ' 2 SrtiXuPo) + (iV/0" 1/2 r 1/2 ^T{S eS (X?; /3 ) - S cff (X i; A))}. 




Note that S'off (Xf; /3q) — 5 e ffpQ;A)) has mean zero, hence Z -1 / 2 5Z' =1 {S , e ffPQ : ; /3 ) — 
5 ff(X"j; /3o)} = P (1). From I = o p (N), we obtain the result in (2.4) immediately. Thus, 
plugging (2.3) and (2.4) into (2.2), we obtain 



The above display is exactly the first order asymptotic expansion of the estimator for 
if we had performed the estimation procedure on the uncontaminatcd data. Thus, 
we have demonstrated that the estimator obtained from contaminated data performs as 
well as the one obtained from uncontaminated data in terms of first order asymptotic 
properties. Note that the efficient estimator can be replaced by a consistent estimator, 
say, a general S instead of S e s, as long as E(S\D = d) = holds for d = 0,1. This 
ensures that E{S(Xf)} = as long as E{S(Xi)} = (shown in Section 3), so the above 
derivation will still carry through. Hence, the asymptotic property of the estimator using 
the contaminated data is indeed the same as if we had the uncontaminated data. Thus, 
the case-control data can be treated as i.i.d. data and we can achieve the same efficiency 
as when the data was indeed i.i.d. In other words, a semiparametric estimator using 
contaminated data is at least as efficient as one using the uncontaminated data. 

One question still remains: can we do even better than in the i.i.d. data case? In 
fact, since case-control sampling is designed to be an efficient way to collect covariate 
information, it seems to contain more information than a random sample. However, we 
claim that for asymptotically linear estimators of the form 



where E{ip(Xf; fto)\d} = 0, the efficiency in parameter estimation cannot be further im- 
proved by taking into account the special sampling procedure. This is because otherwise, 
we could have obtained a better estimator for the i.i.d. sample as well, by replacing 
Xf with Xi. The detailed derivation is the same as in the above paragraph, where the 
condition E{ip(X£; Po)\d} = implies E{ip(Xi; f3o)\d} = for case-control data, which 
ensures E{ip(X?; f3 )} = E{^(Xi; (3 )} = 0. Of course, if the condition E(i/)\d) = is not 
satisfied, the argument does not work. However, we now show that if %j) achieves the 
optimal variance for the case-control data Xf, then it has to satisfy E{ip(Xf; j3o)\d} = 0. 



First, E{dE(ip\D)/dj3} = dE(t/j)/df3 = because the probability density function 
(p.d.f.) of D docs not contain 13. If we let ^(Xf) = ip(X?) - E{i/>(X?)\d}, then 
E{ip(X?)} = and E{d^{Xf)/dp} = E{dip(X?)/df3}. If E{ip(X?)\d} ^ 0, then we can 



var{V(X<0} = E[ va r{^)\D}} + v & v[E{^)\D}} = var^pTf)} + var[£?{^(Xf )|D}] 




VN0 - /?o) = -= 5>pq= ; M + Op(l), 



obtain 
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> y*x{j>(X?)}, 

which, together with E{d$(X?)/d0} = E{dip(Xf)/d/3}, contradicts the fact that ip(Xf) 
is optimal. 

In summary, we have shown that the case control samples can be treated as if they 
were i.i.d. and all the first order asymptotic results for i.i.d. data will be inherited for 
case-control data as well. We can see that the above establishment is similar to the devel- 
opment in Breslow et al. (2000). However, one prominent difference is that in Breslow et 
at (2000), the case-control sample is viewed as the result of a biased sampling procedure 
with fixed subsample size, hence they cannot use the classical scmiparametric theory for 
i.i.d. data, but have to refer to McNcney (1998) for the theoretical properties, where the 
whole semiparametric theory for fixed-size subsamples is established in parallel to the 
i.i.d. framework. Here, through introducing the notion of hypothetical population and by 
analyzing the first order equivalence between a random sample and a sample with fixed- 
size subsamples, we can easily contain the case-control problem in the usual i.i.d. model 
framework. The derivation is much simpler and more elegant. Thus, in the remainder of 
the paper, we ignore the case-control nature of the data and proceed with our analysis by 
pretending the data is i.i.d. from the aforementioned hypothetical population of interest. 

3. A semiparametric efficient estimator 
3.1. Geometric approach 

A random sample from the hypothetical population of interest has p.d.f. 

p(g,e,d;P,v)=PD(d)p G ,E\D{g,e\d)=p D (d)p t GJSlD (g,e\d) 

= PD{d)p t G (g) P t E {e)p t DlG E {d\g, e)/p t D {d) (3.1) 

_ N d q(g)T](e)H{d,g,e) 
N pt D (d) 

Here, the superscript ' stands for the p.d.f. in the true population, whereas expressions 
without superscripts, including various p.d.f. 's and expectation E, are quantities in the 
hypothetical population of interest; 77(e) = p t E (e) is the unknown infinite-dimensional 
parameter and 

H(d, g, e; (3) = exp[d{m{g, e)}]/[l + exp{m(g, e)}] 

= exp{d(/3 c + (3 l9 + f3 2 e + foge)}/{\ + cxp(/3 c + fag + (3 2 e + (3 3 ge)}; 

PD(d;P,ri) = J q(9,Pi)v(e)H(d,g,e;P) d/j(g) d/j(e). 

We recognize that estimating the finite-dimensional parameter j3 in the presence of an 
infinite-dimensional nuisance parameter 77, using an i.i.d. sample of size N = Nq + N\ from 
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a hypothetical population of interest, with the p.d.f. of a random observation given by 
(3.1), is a classical semiparametric problem. Therefore, we implement the semiparametric 
estimation methods to derive the semiparametric efficient estimator. The approach we 
take is geometric, first introduced in Bickel et al. (1993). Because the general approach 
and related concepts have been nicely described in several recent papers including Tsiatis 
and Ma (2004), Allen et al. (2005), Ma et al. (2005) and Ma and Tsiatis (2006), here, 
we only briefly outline the general approach and the definition of the relevant concepts, 
referring the reader to these papers for more detailed descriptions. 

In general semiparametric problems, one approach to construct estimators for fj is to 
obtain some influence function cj)(Xi;f3, rf) which is subsequently used to form estimating 
equations for fj in the form of X^=i 0(^tj A v) — 0- Here, Xi = (Gi,Ei,Df), i = l,.,.,N, 
are i.i.d. observations. The solution of the estimating equation, fj, is subsequently a 
semiparametric estimator and its variance has been established to be equal to the variance 
of 4>{Xi\ fj, rf). Consequently, the optimal estimator among the class of all such estimators 
is the one whose influence function has the smallest variance. This is usually referred to 
as the semiparametric efficient estimator. 

The geometric approach considers the space in which all influence functions belong. 
Specifically, one considers a Hilbert space % which consists of all zero-mean measurable 
functions with finite variance and the same dimension as fj. The inner product in Ji is 
defined as the covariance. The Hilbert space ~H is further decomposed into two spaces, 
the nuisance tangent space A and its orthogonal complement A- 1 . 

To understand the nuisance tangent space A, consider first the case where the nui- 
sance parameter, denoted 7, is finite-dimensional. Then, the nuisance score function, 
S* 7 = d\ogp(Xi; fj,^)/dj, spans a linear space, which is denoted A. In the case of the 
infinite-dimensional nuisance parameter rj, the corresponding A is defined as the mean 
squared closure of the span of all the nuisance score functions 5" 7 , where p{Xf, fj, 7) is any 
parametric submodel of p(Xi] fj, rf). The orthogonal complement of A in % is subsequently 
defined as A- 1 . 

Any function in A 1 - can be properly normalized to obtain a valid influence function. 
On the other hand, every influence function is a function in A^. Among all these func- 
tions, the projection of the score function S@ = d\ogp(Xi; fj, 7)/<9/3 results in the efficient 
influence function. If we denote the projection by S e fi, then the corresponding optimal 
variance is var(S' fi) _1 . The projection S c g is usually called the efficient score function. 

Hence, the geometric approach converts the problem of searching for efficient semi- 
parametric estimators to the problem of calculating S c ff . 

3.2. Construction of the estimator 

Following the description in Section 3.1, we obtain the efficient score function S c g. View- 
ing the sample as random from the hypothetical population, the p.d.f. in (3.1) is no 
longer in a simple multiplicative form, in that the nuisance parameter appears both in 
the numerator and in the integral in the denominator. Since this implies that the nui- 
sance tangent space is not automatically orthogonal to the score functions, the related 
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computation for the nuisance tangent space and associated objects is more involved. In 
addition, one needs to be aware that the calculation should be carried out with respect to 
the hypothetical population, hence quantities such as p^.p^.p^ need to be treated with 
extra care and not confused with PGiPe,Pd- The main steps of the derivation are as fol- 
lows. We first calculate the score function Sp by taking the derivative of \ogp(g, e, d; f3, rf) 
with respect to f3. This results in Sp = S — E(S\d), where 



We then calculate the two spaces A, A 1 - by replacing r\ in (3.1) with a finite-dimensional 
parameter 7, taking the derivative of logp(g, e, d; (3, 7) with respect to 7 to obtain 5 7 , 
hypothesizing a space of all such 5 7 and proving that A is equivalent to this space. The 
results are 

A= [h(e) - E{h(e)\d}: V/i(e) such that E t (h)=0] = [h(e) - E{h(e)\d}: Wi(e)], 
A. x = [h(g,e,d): E(h\e) = E{E(h\d)\e}}. 

We finally project the score vector Sp onto A x to obtain S e s — Sp — /(e) + E(f\d) = 
S - E(S\d) - /(e) + E(f\d), where /(e) — E(f\d) represents the projection of Sp onto A. 
The details of the derivation can be found in the Appendix. Note that this form of S e g 
implies that E{S e g(X)\d} = 0. When X is replaced by X c , the non-random case-control 
sample, we still have E{S e s(X c )\d} =0 because the design itself guarantees that the 
only non-random component is d, which is held constant. Thus, viewing X c as a special 
contaminated version of X, we still have E{S c s (X c )} — 0, which is required in Section 2. 
From the Appendix, we can further write 



where a(0) - a(l) = E(f\D = 0) - E(S\D = 0) - E{f\D = 1) + E(S\D = 1). 

In terms of the calculation of S e g, note that S, E(S\e) and w, as given in (A.l), are all 
functions with parameters /3 and p^id) only. Hence, as long as we can calculate p t D (d), 
we will have the ability to evaluate S, E(S\e) and w. The computation of a(0) — a(l) 
requires further arguments. 

In the following, we first obtain an approximation of p t D (d) 1 then pursue the estimation 
of a(0) — a(l). To estimate p t D (d), using ps(e) to denote the probability density function 
of e in the hypothetical population, we observe that 




5W = S - E(S\e) + (^l) d {a(0) - a(l)}w(e, 1 - d), 



(3.2) 





= E, 



, f N J N d q(g,^)H(d, 9,6)^(3)/^) ] 
lEdl N d q{g, fa)H(d,g, e) d^g)/^ (d) J ' 
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Replacing the moment E e with its sample moment through averaging across different 
observed e^'s, we obtain 

"~^E d /^(s,^)H(d,«7, ei )dM(ff)/pi,(d) ' ( j 

Note that the above two equations are not independent - one determines the other. But, 
in combination with p^(0) = 1, we can estimate p t D {d) completely. Because the 

only approximation involved in estimating p t D {d) is replacing the mean with a sample 
mean, the calculation will produce a root- TV-consistent estimator for p^,(0) and pjj(l). 
We denote the estimators by p^(0) and p^,(l). In calculating N d , we write p(g,e,d) as 
PE(e)PD,G\E(d, g\e), instead of directly using the form in (3.1). Since pe(b) is the p.d.f. of 
the environment variable in the hypothetical population, this enables us to replace the 
expectation E e with the average of the samples. 

The estimation of a(0) — a(l) is much more tedious, and involves an almost brute 
force calculation of E(S\d) and E(f\d). If we let b = E(S\D = 0), h = E(S\D = 1), Co = 
E(f\D = 0) and a = E(f\D = 1), then o(0) - o(l) = h - b Q +c - Ci. The calculation of 
&o and bi follows from 

, _ / Sp D ,G,E{d,g,e)dn(g)dn(e) _ J Sp E (e)p D ^ G \ E {d,g\e) dfijg) dp,(e) 
JPD,G,E(d,g, e) dfj,(g) d/i(e) f PE{e)p D , G \ E (d,g\e) d/j,(g) d/i(e) 

, x /gjV rf g(g)g(d,g,e)d/z( g )/^(d) 

^ Erf I N d q(9)H(d, g, e) d^/vW) d/Aej 

, x /jV rf g(ff)g(^g,e)d/x(g)/p« ? (d) 

PB 16 j E d / N d q{g)H{d, g, e) d^)/p*,(d) MU ' 

Since S 1 can be calculated directly, we simply obtain the approximation of b d , d = 0, 1, by 
replacing the mean with sample mean and plugging in the estimated p l D (d) : 



t = y> J S(0,g, e l )q(g)H(0, g, g) dfi(g) 
° h^d! N d q{g)H{d,g^)dfji{g)/f D {d) 

/y> Jq(9)H(0,g,ei)dfj.(g) 
I t[ Y ld fN d q{ 9 )H{d,g,fx)eiv,(g)ltf D {dy 

r = \p J S{l,g,e l )q(g)H(l,g,e,)dfi(g) 
1 gEd/ N d q(g)H(d, g, e) d^/p^d) 

/y> Jq(g)H(l,g,ei)dfi(g) 
I hi Ed / N d q(g)H(d, g, e) d^(g)/f D {d) ' 



(3.4) 



(3.5) 
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The calculations of cq and c\ are a bit more tricky. Since 

/ = E(S\e) + (co - b o )w(e,0) + (ci - &i){l - w(e,0)}, 

taking expectation conditional on, say D = 0, we have 

c = E{E{S\e)\D = 0} + (c - &o)-E{w(e,0)|.D = 0} 
+ (c 1 -6 1 )[l-S{w(e,0)|D = 0}] 

or, equivalently, we obtain 

_ E{E(S\e)\D = 0} - b o E{w(e,0)\D = 0} - - £{w(e, 0)|L> = 0}] 
C ° ~ Cl ~ l-.E{u;(e,0)|I> = 0} ' 

Hence, replacing mean by sample mean and using p D (d), Co — c\ is estimated by 
E{E(S\e)\D = 0} - b E{w(e,0)\D = 0} - S x [l - J5{ttf(e,0)|D = 0}] 



c - ci = 



l-£{ii;(e,0)|L> = 0} 



(3.6) 



where 



pr,„/„ m in - m - V" ^(e I ,0)/g(,g)g(0,,9,e. t )d/i(g) 



N 

"E 



(3.7) 



and 



N 



E{E(S\e)\D = 0} = Y, 



t Erf / N d q(9)H(d : g, e*) d/x( fl )/p* 3 (d) 



^(5|e. t )/g(, 9 )iJ(0,, 9 , ei ;)d/i( g ) 
=i E d /^9(i7)^(d,5,ei)dM(5)/pi,(d) 

'y> J g(g)H(0,g,e l )dfi(g) 

h Erf / N d q{g)H(d, g, e 2 ) Mg)/p* D {d) ' 



(3.8) 



Similarly to the estimation of p l D (d), the only approximation involved in obtaining 
6(0), 6(1) and c(0) — c(l) is replacing mean by sample mean, so a(0) — a(l) is estimated 
using a(0) — a(l) = &i — 6o + Co — Ci at the root-iV rate. 

We would like to emphasize that in all of the above calculations, when we replace the 
expectation with the sample average, we use the result that the case-control sample can 
be treated as a random sample from the hypothetical population. Hence, for any function 
u(e), the approximation N~ 1 ^2f =1 u(ei) can only be used to replace J u(e)pE(e) dp,(e), 
not J u(e)rj(e) dp{e). 
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We omitted the parameter (3 in all of the above expressions, in fact, p^>(0),p^,(l), a(0) — 
a(l) are all functions of (3. However, if we replace f3 with /3, an initial estimator of 
P, we will still obtain p t D (d; f3),a(0; (3) — a(l;/3) that are root-iV-consistent, as long as 
P — f3 — O p (N~ 1 ^ 2 ). The final estimating equation of f3 has the form 

N N 

^5 cff (x l ;/3) = ^5 cff {x I ;/3,# D (d;/3),a(0;^,/3)-a(l;p t D ,^)} = 0, (3.9) 

i=l »=1 

where Xi denotes the zth observation (dj,ffi,e»). 

To summarize the description of the estimator, we outline the algorithm here: 

Step 1. Pick a starting value /3 that is root- TV consistent. 

Step 2. Solve for p* D (0) and f D {\) = l-p* D (0) from (3.3). 

Step 3. Obtain 6 and 6i from (3.4) and (3.5). 

Step 4. Obtain c - c x from (3.6) and (3.7), (3.8). 

Step 5. Calculate S e g using (3.2) and obtain p from solving (3.9). 

It is worth pointing out that in order to carry out Step 1, we have used a vital assump- 
tion that a root-./V starting value j3 exists. Fortunately, the existence of P is equivalent to 
the identifiability of /3 and is already well established in Chatterjee and Carroll (2005). 
The starting value used there, or in Spinka et al. (2005), can be used to obtain the initial 
estimator fj. Our algorithm here does not require an iteration of Steps 2-5 upon each 
update of f3. However, in practice, a more accurate (3 can improve the final estimation (3 
significantly, hence iterations are almost always implemented. 

3.3. Semiparametric efficiency 

If we could use the exact p t D (d;f3) and a(0;/3) — a(l;/3) in (3.9), then, according to 
Section 3.1, the resulting estimator for (3 would be an efficient estimator, with es- 
timation variance V = E(S ffSj ff )~ 1 . To first order, V can be approximated using 
W{££i&ff(*i; wh ere f3 solves (3.9). 

We claim that using the estimated S'ofi as in (3.9), we obtain an estimating equation 
that yields the same estimator for f3 as using S e g, in terms of its first order asymptotic 
properties. 

Theorem 1. The algorithm in Section 3.2 yields a semiparametric efficient estimator 
for (3. That is, 

VN0 - Po) -> Normal{0, var^)- 1 } 
in distribution when N — > oo and Ni/Nq is fixed. 

The proof of the theorem contains two main steps. In the first step, we show the 
semiparametric efficiency of the estimator if the observations had been i.i.d. In the second 
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Table 1. Simulation results for the two experiments, each with two different sets of parame- 
ter values, representing uncommon (upper-left) and common (upper-right) gene mutation, and 
homogeneous (lower-left) and diversified (lower-right) gene expression levels, 'true' is the true 
value of P, 'est' is the average of the estimated f3, 'sd' is the sample standard deviation and 'sd' 
is the average of the estimated standard deviation 





Pc 


Pi 


02 


Pa 


Pi Pc 


Pi 


Pi 


P3 


Pi 












Experiment 1 










true 


-3.2000 


0.2600 


0.1000 


0.3000 


0.0650 -3.4500 


0.2600 


0.1000 


0.3000 


0.2600 


est 


-3.8925 


0.2498 


0.0995 


0.3101 


0.0649 -3.9263 


0.2618 


0.0994 


0.2998 


0.2610 


sd 


1.6390 


0.3110 


0.0359 


0.1226 


0.0111 1.3958 


0.2196 


0.0445 


0.0783 


0.0229 


sd 


1.6285 


0.3236 


0.0364 


0.1192 


0.0116 1.2534 


0.1956 


0.0422 


0.0723 


0.0207 












Experiment 2 










true 


-3.2000 


0.2600 


0.1000 


0.3000 


0.3000 -3.7300 


0.2600 


0.1000 


0.3000 


1.0000 


est 


-3.3128 


0.2553 


0.0993 


0.3126 


0.2999 -3.7442 


0.2589 


0.0995 


0.3053 


0.9986 


sd 


0.7815 


0.1624 


0.0352 


0.0750 


0.0101 0.2906 


0.0685 


0.0442 


0.0405 


0.0378 


sd 


0.7969 


0.1663 


0.0358 


0.0789 


0.0101 0.2859 


0.0676 


0.0439 


0.0402 


0.0373 



step, we proceed to show the efficiency in the case-control study using results in Section 2. 
Rather complex algebra needs to be employed in the first step. The proof also involves 
a split of the data in the final estimation of /3, and in estimating p^id) and a(0) — a(l), 
mainly for technical convenience. The details of the proof appear in the Appendix. 

4. Numerical examples 

We conducted a small simulation study to demonstrate the performance of the estima- 
tor. In the first experiment, we generated 500 cases and 500 controls, where the true 
environment element E is min(10, X) and X is generated from a log-normal distribution 
with mean and variance 1. A dichotomous model of the gene is used, where G = 1 with 
probability f3± and G = with probability 1 — /?4. This kind of model for q(g,Pi) can 
represent the presence/absence of a certain gene mutation. We used two different sets of 
values for f3: the first set is = (—3.45, 0.26, 0.1, 0.3, 0.26) T , where @4 = 0.26 represents 
a relatively common mutation; the second set is j3 = (—3.2, 0.26, 0.1, 0.3, 0.065) T , where 
/?4 = 0.065 represents a very rare mutation. In both sets, the true parameters are chosen 
so that the model results in a population disease rate £>fj(l) ~ 5%. The simulation results 
are presented in the upper half of Table 1 . 

The second experiment differs from the first one in its assumption on q(g, ^4). Here, we 
model q{g,P4) with a Laplace distribution with variance $4. This kind of model is typi- 
cally used to model the gene expression level. To maintain an approximate 5% disease rate 
in the population, we used j3 = (-3.2, 0.26, 0.1, 0.3, 0.3) T and j3 = (-3.73, 0.26, 0.1, 0.3, 1) T 
as the true parameter values. Again, in the first set, ^4 = 0.3 represents a small variation 
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in the population distribution for the gene expression levels, resulting in a more homo- 
geneous population in terms of this gene. In the second set, /3 4 = 1 represents a larger 
variation, so the population is more diversified. The simulation results are presented in 
the lower half of Table 1. In both experiments, 1000 simulations are implemented. 

From Table 1, it is clear that the estimator for f3 is consistent in all four situations 
and the estimated standard deviation approximates the true one rather well. It is worth 
noting that the first experiment is a repetition of the same setting as in Chatterjee and 
Carroll (2005) and we observe very similar results. Specifically, for fli, fa, (3\ in the 
upper-left table, their results for "sd" arc 0.322, 0.037, 0.128, 0.0122, respectively, and 
those in the upper-right table are 0.198, 0.043, 0.075 and 0.0273, respectively. Although 
some numerical improvement can be observed in certain parameters (for example ^4), 
it can be a result of finite-sample performance and numerical issues. We conjecture that 
the estimator in Chatterjee and Carroll (2005) is equivalent to the method proposed 
here, hence is also efficient, although a rigorous proof is beyond the scope of this paper. 
It is also worth noting that the estimation of (3 C is more difficult than the remaining 
components of /3, in that the estimation has large variability. This is especially prominent 
in the discrete model setting for q(g). Indeed, the estimation result for /3 C has not been 
reported elsewhere and, without the gene- environment independence, f3 c is known to be 
unidentifiable (Prentice and Pyke (1979)). This provides an intuitive explanation for the 
performance of fJ c we observe. The set of estimating equations is solved using a standard 
Newton-Raphson algorithm. 

5. Conclusion 

Semiparametric modeling and estimation to study the occurrence of a disease in rela- 
tion to gene and environment has attracted much interest recently. However, despite 
the various estimators proposed in the literature, very little is understood in terms of 
the efficiency of the estimators. This is partly due to the fact that most estimators are 
constructed in rather ingenious ways, instead of following the standard lines of semi- 
parametric theory. The other reason is that most such problems are set in a case-control 
design, which violates the i.i.d. assumption for standard semiparametric theory. 

Instead of rederiving the whole semiparametric theory under non-i.i.d. samples, we 
argue that case-control data can be treated as if they were i.i.d. data and the stan- 
dard semiparametric efficiency theory will still apply. The equivalence of the first order 
asymptotic theory shown in this article is a new contribution. The argument is based on 
rather elementary statistics without involving advanced knowledge or highly specialized 
techniques. 

The establishment of the equivalence of the semiparametric efficiency between 
i.i.d. data and case-control data allows us to carry out the estimation using standard, 
well-established semiparametric theory. However, these standard analyses arc performed 
under a hypothetical population of interest, hence the detailed derivation often requires 
special treatment, something which has not previously appeared in the literature. Under 
the gene-environment independence assumption, we are able to explicitly construct a 
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novel semiparametric estimator and show its efficiency. A special feature of this estima- 
tor is that we never attempted to estimate the infinite-dimensional nuisance parameter rj 
itself, neither did we posit a model, true or false, for it. Rather, we avoided its estimation 
and instead approximated quantities that rely on it. On the one hand, this enables us 
to carry out the estimation rather easily; on the other hand, some asymptotic properties 
have to be rederived because any result that relies on the convergence properties of the 
nuisance parameter itself can no longer be used. 

Finally, our simulation results support the theory we developed, in both discrete and 
continuous gene distribution cases. Our simulation results in the discrete gene model are 
very similar to those of Chatterjee and Carroll (2005), which leads us to believe that 
their estimator is also efficient. A demonstration of this aspect would be an interesting 
direction for future work. The programming of the method in Chatterjee and Carroll 
may seem easier. However, if the two methods are indeed equivalent, then the projection 
step in the current method should be equivalent to the profiling step in Chatterjee and 
Carroll, hence the computational effort and intensity should be equivalent. Although we 
did not further expand our estimator to stratified case-control data, the method is clearly 
applicable there as well. 



Appendix 

The derivation of S e ff 

We will use S e g to construct our estimating equation. We calculate S e g by projecting 
the score functions with respect to the parameters of interest ft c , fti, ft?., fti, fti onto the 
orthogonal complement of the nuisance tangent space. We first derive the score functions 
Sp = d\ogp(g, e, d; ft, rf)/dft. Straightforward calculation shows that the score function 
Sp = (S?,S$) T , where 

s i = ( m k m k m k m k) ( d - 1 + i — ) - E \ ( m k m k m k m k)( d ~ 1 + 



l_|_ e "l / v P" P 1 P 2 P3/ I 1+C 



>J2 = — & 



q(g,M I q(g,Pi) 

Here, and represent partial derivatives with respect to *. Note that, in general, 
can be written as Sp = S — E(S\d). 

We next derive the nuisance tangent space A and its orthogonal complement A^. 
Inserting the form of p t D {d; P^rf) into (3.1), replacing 77(e) by an arbitrary submodel 
p t E (e;j) and taking the derivative of logp(<7, e, d; ft, 7) with respect to 7, we ob- 
tain d\ogp(g,e,d;ft 7 j)/d'f = d\ogp t E (e- : j)/d'-f — E{d\ogp t E (e;j)/d"f\d}. Now, recogniz- 
ing that <91ogp^(e;7)/<97 for an arbitrary submodel can yield an arbitrary function of e 
with mean zero calculated under the true 77(e), we obtain the nuisance tangent space: 

\=[h(e)-E{h(e)\d}: Wi(e) such that E t (h)=0] = [h(e) - E{h(e)\d}: Wi(e)], 
A- 1 = \h{g,e,d): E(h\e) = E{E(h\d)\e}]. 
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Here, E l stands for an expectation calculated with respect to the true population distri- 
bution. The second expression for A is more convenient because it allows hie) to be an 
arbitrary function of e, hence this is the form of A that we will use. 

Having obtained Sp and the spaces A and A ± , we can proceed to derive the efficient 
score function S eS = n(S' )3 |A- L ). If we let n(^|A) = /(e) - E(f\d), then S eS = Sp - 
/(e) + E(f\d) = S- E(S\d) - /(e) + E(f\d). 

We now modify the expression of S a s to facilitate its actual computation. Letting 
a(d) = E(f\d) — E(S\d), we can thus write S e s = S — f + a(d). Note that S does not de- 
pend on 77 and a(d) is either a(l) or a(0). In addition, we have E(S c s |e) = E{E(S c s\d)\e}. 
This is equivalent to 

E(S \e) - /(e) + E{E(f\d)\e} = E[E{S - E(S\d)\d} - E{f - E(f\d)\d}\e] = 0, 

which, in turn, is equivalent to 

E(S\e) = f + E{E(S\d)\e} - E{E(f\d)\e} = / - E{a(d)\e} 

= f E d Ja(d)N d q(9,^)H(d,g,e)d f i(g)/ P t D (d) 
Y, d S N d<l{9,MH{d,g,e)d^g)/ P t D {d) ' 

Let 

v(e,d)=N d J q{g,(3 A )H{d,g,e)&Li{g)/p t D {d)= PE , D {e,d)NT 1 - 1 {e) 

and 

w(e,d) = v(e,d)/{v(e,0) + v(e,l)}. (A.l) 

We have 

E(S\e) = / - a(0)v(e, 0)/{v(e, 0) + v(e, 1)} - a(l)v(e, l)/{w(e, 0) + v(e, 1)} 
= / - a(0)w(e, 0) - a(l)w(e, 1) 

or / = E{S\e) + a(0)w(e, 0) + a(l)w(e, 1). Consequently, 

S eS = S - E{S\e) - a{0)w(e, 0) - a(l)w(e, 1) + a(d) 
= S- E(S\e) + (-l) d {a(0) - a{l)}w(e, 1 - d). 

Proof of Theorem 1 

To simplify notation, we denote a =p^,(0)/p^,(l), a = p t D (Q)/p t D (l), 5(a) = a{0;p t D (d)} — 
o{l;p* J (d)} ) 5(a) = a{0;p* D (d)} - o{l;^(d)} and 5(a) = a{0;p* D (d)} - S{l;^(d)}. 
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Suppose we randomly partition the data into two groups: group 1 has m observations 
and group 2 has n observations. Here, m = N - 9 , n = N — m. We use the first group to 
obtain d, and 6(a), then use the second group to form the following estimating equation 
to estimate /3: 

n 

^ S e ft{xi] /3, a, 5(a)} = 0. 

i=l 

We will first show that the resulting estimator satisfies n x l 2 (fi — /3q) — > N(0, V) in distri- 
bution when TV — > oo. 

The proof splits into several steps: First, obviously, a — a = O p (m-^ 2 ) and 6(a) - 
5(a) = O p (m -1 / 2 ), as long as a root- TV-consistent /3 is inserted in the calculation of these 
quantities. A standard expansion yields 

n 

= ^2 S e s{xi;j3, a, 6(a)} 

i=l 

" n d 

= ^2 S es{ x i'-> A), &, 6(a)} + —^S e s{xi\ (3*,a,6(a)}0 - /3 ) 

»=1 i=l " 

= J^Seff{a;i;A),tt,«(tt)}+n{^(^) +o p (l)|(/3-/3 ), 
which can be rewritten as 

i?(f^) +9p(l)}n 1/a tf-/fc) 

n 

= -n~ 1/2 ^ S e ff{£i; /3 , a, 5(a)} 

i=l 
n 

= -n~ 1/2 £[Seff{xi; A), + (-l)*{$(a) - 6(a)}w(e,, 1 - *,&)]. 

4=1 

The last equality uses the form of S c s in (3.2) and the fact that S, E(S\e) and w do not 
depend on a. Because 6(a) — 6(a) = Oj^rn -1 ' 2 ) = o p (l) and 



P r, iV l / -i j «ii f v _1 ) PE,D(e,l-d;a)r] l (e) 

E{(-1) *w(ei,l-d t ,a)}= / > — - — — — p EtD (e,d;a)dfi(e) =0, 

J d~oi v(e,0;a)+v(e,l;a) 

we actually have 
{E{^-)+o P (^)}n 1/2 0-M 



Semiparametric efficiency in case-control studies 601 

n 

= -n~ 1/2 y^^S oB {xi;/3 ,a,6(a)} + o p (l) 



i=l 



= -n 1 2^ i {S c ft{x l )A {a-a)-\ — ^ {a -a) t >+o p {l). 



da da 2 

t—L " 

In addition, (a — a) 2 = O p (m _1 ) = o p (n~ 1 / 2 ), so 
E {w) +°p( 1 )} nl/2 (P- A)) = -n-V^j^CxO + - a) | +o p (l). 

We now proceed to examine 3S °»^ a: ') by examining each term in (3.2). S is free of a. 
As a function of a, we already have 

)=F(ql s EJ SWdgto, g, e) d/;( g )M,(rf) 

51 ' J " [ ' ' aj Ed / N d q(9, h)H(d, 9, e) dMs)M,(d) 

/ SN qH dfi(g)+aJ SNigHidfjjg) _ u 2 (e, 0) + au 2 (e, 1) 
/ N qH Q d^(g) + aj N\qHx dfi(g) ~ «i(e, 0) + oui(e, 1) ' 

where we define Ui(e, d) = J N d q(g, ^4)H(d, g, e)dfi(g) and U2(e,d) = J SN d q{g 1 f3i)H(d, 
g,e)d/j,(g). Using this notation, 



db 5 _ u 2 (e, l)ui(e,0) - u 2 (e, 0)ui(e, 1) 
w(e,0) 



da {ui(e, 0) + a«i(e, l)} 2 

ui(e,0) 



w(e, 1) 



ui(e,0) + au\{e, 1) 
cmi(e, 1) 



ui(e, 0) +aui(e, 1) 
Similarly to the calculation of bo,bi, we also have that for any function it, 

7-1/ r j s /" Pi? (g) / uN d q(g)H(d, g, e) d/x(ff)/g^ (d) 

Pfl(e) J N d g(g)H(d,g, e) d^/p^jd) 
£ d / N d q(g)H(d, g, e) d^/p^d) M6J 

p E (e) / uN d q(g)H(d, g, e) d//(t?)/p^(d) 



£ d wi(e,d)/p^,(d) 
d/i(e), 



d^(e) 



p E (.e)u 1 (e 1 d)/p t D (d) 



J2dM^d)/p t D (d) 
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E(u 0;a)= / r —— -— d/x(e)/ / d/x(e), 

7 «i(e, 0) +u\{e, l)a I J ui(e,0) + u\{e, l)a 

E(u\l; a)=[ PE(e)SuN iq {g)H(l,ge)d^(g) I f p E (e) Ul (el) 

J Ul (e,0) + Ul (e,l)a PV V J ui(e,0) + «i(e,l)a PV ; 

These relations lead to 

/■ p B (e)w 2 (e,0) , / /" pij(e)ui(e,0) 

60 = / ? nVr — r~rr d/i ( e V / ? nrz — r~rr d/i ( e )' 

7 ui(e,0j + ui(e, l)a I J u\{e, 0) + u\(e, l)a 

J p E (e)u 2 (el) if PE(eK(el) 

J u 1 (e > 0) + «i(e,l)a W Vj ui( ej 0) + ui(e,l)a 
6 2 = £{£;(,S|e)|0;a} 

p B (e) / E(S\e)N o q(g)H(0,g, e) Mg) J f p E (e) Ul (e,0) 



u\(e, 0) + u±(e, l)a / 7 ui(e, 0) + «i(e, l)a 

p*(e)25(5|e) ttl (e > 0) ^ / /" P.^iM) ^ 



wi(e, 0) + Ui(e, l)a / 7 tti(e, 0)+tii(e, l)a 

PB(e)Mi(e,0){u 2 (e,0) + u 2 (e, l)a} . / /" p B (e)ui(e,0) 

d M e )/ / — ? m l / ^ d /i(e) 



{ui(e, 0) + ui(e, l)a} 2 I J u\{e, 0) + Ui(e, l)a 

6s = S{«;(e,0)|D = 0} 

gg(g) / w(e, 0)N o q{g)H(0, g, e) dfi(g) 1 f p g (e)Mi(e,0) . 



ui(e,0) + iti(e, l)a / 7 ui(e, 0) + u\(e, l)a 

p E (e)it 2 (e,0) , . / / p B (e)wi(e,0) 



{u 1 (e,0)+u 1 (e,l)a} 2d/i(e V 7 ui(e, 0) + ui(e, l)a d/i(e) ' 
Consequently, we obtain 

62-6063-61(1-63)1 emi(e,l) 



5eff(0) = 5- 65(e) + <{ 61-60 + 

= S- 65(e) ■ 
5 eff (l) = S'-6 5 (e) 



1 - 63 J ui(e,0) + aui(e, 1) 

62 ^ 60 \ a«i(e, 1) 



1 — 63 / ui(e,0) + aui(e, 1) ' 
>2-6 \ ui(e,0) 



1 - 63 / ui (e, 0) + aui (e, 1) ' 

95 off (0) . , /62-60V aiti(e,l) (b 2 -b \ u\{e, 0)ui(e, 1) 

^ -65(e) 



9a V 1 — 63 / ui(e, 0) + otu\{e, 1) \ 1 — 63 / {ui(e, 0) + aui(e, l)} 2 ' 

dS c ff(l) , /6 2 -6 \ ui(e,0) (b 2 -b Q \ u\{e, 0)ui(e, 1) 



-6 5 (e)'-(^ 



9a \ 1 — 63 / ui(e, 0) + aiti(e, 1) \ 1 — 63 J {u\(e, 0) + aui(e, 1)} 
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Since S does not contain a, is a function of (e, d) only. Because p E ,D(e,d) = 

n(e)u 1 (e,d)/{Np t D (d)}, we have p E , D (e,0) = (l + a)r](e)ui(e,0)/(Na), p EiD (e,l) = (1 + 
a)n(e)ui(e,l)/N and p E {e) = (1 + a)?y(e){wi(e, 0) + au\{e, l)}/(iVa). Combining these 
results, we have 



E 



dS, 



oil' 



9a 



ui(e, 0)ui(e, 1) 



"^ (e) ' \ l-b 3 J { Ul (e,0) + a«iO 
-u 2 (e, l)ui(e,0) + M 2 (e,0)ui(e, 1) 



{ui(e,0) +aui(e,l)} i 



6 2 — 6 C 



iti(e,0)wi(e,l) 



{wi(e,0) + cmi(e,l)p 



Plugging in the expressions for 60,62,63, we obtain 



62 - 6 
1-6, 



p E (e)ui(e,0){u 2 (e,0) + u 2 (e, l)a} 
{wi(e,0) + ui(e,l)o;} 2 

p E (e)ui(e,0) 



dfi{e) 



p E (e)u 2 (e,0) 
ui(e,0) + ui(e, l)a 



d/x(e) 



Ui(e, 0) + ui(e, l)a 



d/x(e) - 



p B (e)w?(e,0) 



{wi(e,0) + ui(e,l)a} ; 



■d/i(e) 



ap E (e){ui(e, 0)u 2 {e, 1) - -»i(e, l)M 2 (e, 0)} 
{u 1 (e,0) + ui(e,l)a} 2 

ap E (e)ui{e,Q)ui{e, 1) 



d/x(e) 



{ Ul (e,0) + U i(e,l)«} 2 ' dAi(e) 
{wi(e,0)w 2 (e, 1) - wi(e, l)w 2 (e, 0)} 



{wi(e,0) + u 1 (e,l)a} 2 



E 



ni(e,0)ui(e, 1) 
{ U i(e,0)+ui(e,l)ap 



thus, we have E(dS e s/da) = 0. 

The fact that E(dS c ff/da) = 0, in combination with a — a = o p (l), yields 

{ E {w) + °P^} nl/2 (P - P°) = -n' 1 ' 2 f> ff (x 4 ) + o p (l). 

Thus, we indeed have n 1 ^ 2 (/3 — /3 ) ~ TV(0, V) asymptotically. 

In fact, the classical 7V 1 / 2 (/3 — f3o) ~ iV(0, V) also holds. This is because 



N 1 ^0-f3 Q )-n 1 ^(f3-f3 o ) = 



when iV — > 00. Thus, our estimator is semiparametric efficient. Because of the equivalence 
result developed in Section 2, the estimator is also semiparametric efficient for case- 
control data. We split the data set into two groups with sizes m and n for simplicity of 
the asymptotic analysis. In reality, one can certainly use the whole data set in each stage 
of the estimation. 
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